Optical emissivity dataset of multi-material heterogeneous designs generated with automated figure extraction

Optical device design is typically an iterative optimization process based on a good initial guess from prior reports. Optical properties databases are useful in this process but difficult to compile because their parsing requires finding relevant papers and manually converting graphical emissivity curves to data tables. Here, we present two contributions: one is a dataset of thermal emissivity records with design-related parameters, and the other is a software tool for automated colored curve data extraction from scientific plots. We manually collected 64 papers with 176 figures reporting thermal emissivity and automatically retrieved 153 colored curve data records. The automated figure analysis software pipeline uses Faster R-CNN for axes and legend object detection, EasyOCR for axes numbering recognition, and k-means clustering for colored curve retrieval. Additionally, we manually extracted geometry, materials, and method information from the text to add necessary metadata to each emissivity curve. Finally, we analyzed the dataset to determine the dominant classes of emissivity curves and determine the underlying design parameters leading to a type of emissivity profile.

1 Automated retrieval of design-related parameters from text Automatic tools were insufficient to obtain the desired design-related attributes from the text. Figure captions and in-text figure-referring descriptions were easy to locate but lacked much data. To examine how much information we could automatically retrieve from them, we processed the collected captions and descriptions with the Lawrence Berkeley National Lab Natural Language Processing (LBNLP) package [1]. LBNLP includes standard text mining tools for materials science and chemistry using the pre-trained inorganic materials model [2] for Named Entity Recognition [3] via Long Short Term Memory neural network [4]. The algorithm identifies materials, properties, applications, phases, structure descriptors, synthesis, and characterization methods assigning corresponding tags to the words. However, LBNLP is not specific to the optics domain. Applied to our data, LBNLP highlighted descriptors quite broadly, complicating future studies. Materials and geometries found by LBNLP significantly differed from the manual check for each record. Automatic extraction mistakenly defined a silicon carbide film as the most used design. However, this approach managed to highlight tungsten and tantalum slabs with a 2D array of cylindrical cavities on the surface.

Algorithmic approach to axes regions identification
We used computer vision algorithms implemented in the OpenCV [5] package to localize axes lines on scientific plots. We tested two approaches. First was the Canny edge detection [6] combined with polygon approximation [5]. It detected axes box, drawing the rectangular on top of the four axes lines (all of which had to be present on the plot for this approach to work). It located the axes box on 63% of figures, and all of them were correct (sometimes slightly displaced). The second approach used Canny edge detection combined with the Probabilistic Hough line transform [7]. It detected each axes line separately, aiming to locate x-axis and y-axis lines. This approach found lines on 92% of figures from our set, but it made mistakes such as confusing grid lines with axes lines. Then, we applied both approaches sequentially and correctly found axes lines for 95% of figures in the dataset. The localization of axis lines required a large amount of custom code, leaving out the numbering and ticks detection. Also, this approach is unreliable as different images will likely result in new issues to handle. All in all, the traditional methods performed in a manner that suggested they would not be robust to future changes.

Ticks location during the automated axes scale parsing
During the development of the automated axis scale pipeline, we used the EasyOCR [8] package for axes numbering detection and recognition and assumed that ticks were located at the center of the detected number box. This assumption is valid if two conditions are satisfied: (i) EasyOCR must correctly localize the number region with a tight box; (ii) ticks must be centered to numbers. A manual check proved that in most cases, both conditions were true. Figure 1 shows some examples of axes scale parsing with the original axes region and the result of automated axes detection next to each other. Assumed (green) ticks closely match with the original ticks.

Image color decomposition: other methods and our approach
We noticed that most of the existing solutions, such as Color Thief [9] tool, Scikit-learn [10] k-means package, Dominant Color Detection [11], missed the colors. Figure 2 shows the incomplete palettes produced by each of the listed methods. Also, the palette changed every time we applied the mentioned method. We assumed that the issue was caused by the random initialization of the color centers. We have mostly white images, and it is statistically difficult to get complete diversity of colors in one random set. We have adjusted the k-means algorithm initialization, forcing it to start with the eight color cluster centers representing a combination of RGB and CMYK modes: white, red, green, blue, cyan, magenta, yellow, and black. Then, we iteratively updated the palette, checking the distance between every pixel and color cluster centers (L2 norm in RGB space). Also, we allowed dropping the empty clusters. This modification resulted in a correct steady set of color centers for each image. Figure 2 shows the palette obtained with our algorithm.
5 Search for the best parameters for unsupervised clustering of curve profiles with DBSCAN DBSCAN method [10] has several parameters to adjust. First, eps -the maximum distance between two objects for being considered as neighbors. Second, min samples -the neighborhood's minimum number of members (or total weight). The third is the metric for distance matrix calculation. The metric did not significantly influence our result, so we set it to be Euclidian and focused on searching for the best values of eps and min samples. We analyzed how the number of clusters and the number of unclustered curves (noise) depend on these parameters. When eps was less than one, only a small portion of samples was clustered, and the noise cluster was large. An increase in eps increased the number of clusters, reducing the noise. However, a further increase in eps reduced the number of clusters as the extracted groups started to concatenate. There was no noise when eps was equal to five, and all entries were put in a single cluster. All in all, we determined that the best values for parameters were eps = 2.6 and min samples = 5, which produced 7 clusters leaving half of the curves as noise. Notably, min samples influenced the number of clusters much stronger than the noise volume. Supposedly, a change in cluster size did not involve new samples to be clustered, simply refining the existing clustering. We map the clusters with the package UMAP [10] in Figure 3. One dot corresponds to a curve; colors correspond to the cluster labels. Although the axes units are noninterpretable, all clusters are well defined in this mapping.

Curves left out as noise by DBSCAN
Unsupervised clustering with the DBSCAN algorithm found groups of similar behavior among half of the records, labeling the other half as noise. We put the noise curves into a single class. Figure 4 demonstrates that the noise class curves have a variety of profiles.

Possible values for various keys in data records
The dataset of thermal emissivity records with metadata is represented as JSON files with various keys. Table 1 provides possible values for the keys in data records. The content is not limited for some keys, and the value can list any number of descriptors. Other keys can have only one value out of a fixed set. Regarding the unlimited values, key "geometry" stores all keywords used in the source paper for the characterization of the geometry, which we considered to be descriptive. Also, under key "materials", we listed all materials used in the device. In contrast, we put a single value from the chosen set for "composition key" and "geometry key". Key "data type" can have one of two values, but the key "tool" lists all mentioned methods. Key "comment" is for any important notes; key "info on image" stores information from a figure given as a text comment or in an inset. Key "color" provides a HEX color code of the cluster center found with the automated curve data extraction algorithm. Key "score" contains value of a quality score estimated during technical validation with values from 0 to 1. 8 Rinsing text off with OCR algorithms Some figures have the text comment of the same color as the curve, and in these cases, one color channel contains both curve and comment information. The comments are of very different content: sometimes it is a word from the standard English language, sometimes it is a sequence of Greek letters or an equation with numbers and mathematical operators. In an attempt to remove the text, we used EasyOCR [8] package allowing any English and Latin letter as well as numbers and symbols. EasyOCR detected text comments but often returned meaningless messages due to the high diversity of allowed symbols. Also, for the curves of complicated behavior, EasyOCR mistakenly detected portions of the curve as text. Figure 5 shows an example of detection with incorrectly recognized text comments and letters "I" and "M" assigned to oscillating parts of curve.

Decision tree for metadata analysis
To better understand the role of metadata for the curve classes, we trained the decision tree with Scikitlearn library [10]. For the geometry, we used a numerical encoding: 0 -film, 1 -1D grating, 2 -bull's eye, 3 -2D grating. The other geometry types were present only in the noise class. We applied a binary encoding for the composition: 0 -single material, 1 -sandwich. For the material list, we used one-hot encoding with the number 1 if the material was present in the structure and 0 if it was absent. After training, the decision tree had an accuracy of 0.86 on five-fold cross-validation. We chose entropy as a splitting criterion because the goal was to determine what parameter had the primary influence on the splitting and led to more information gain. Figure 6 shows the obtained decision tree. Geometry and choice of metals have a major impact on an emissivity curve profile. The decision tree algorithm firstly branches on the geometry, checking if the design has 2D periodicity on the surface. This split contains a significant information gain towards clusterization as the entropy value is decreased by one-third. With 2D periodicity (following the right arrow), the presence of tantalum or tungsten leads to the split into classes 4, 5, and 7. Class 6 is present all over the tree, so we neglect it from the consideration here. Following the left arrow, the tree goes deeper into geometry details, checking if the surface is flat or has grating. Class 1 mainly has "bull's eye" surface grating, while classes 2 and 3 are films with flat surfaces. This analysis corroborates the idea that geometry plays first.

Nine electronic paper scrapers
To check the presence of the keywords in figure captions, we implemented nine electronic paper scrapers for nine publishers. We were working with HTML versions of full-text papers by means of regular expressions. First, we manually determined unique HTML sequences used by each publisher to encode figure object and figure caption, we list them in Table 2. We chose HTML sequences in the way that the part of electronic paper from the start-sequence to the end-sequence contained only the caption sentence and other HTML encoding tags. Next, in every HTML paper, we located all start and end sequences with regular expressions and matched them in pairs combining start-sequence with the first following end-sequence. Then, we extracted paper parts for every start-end pair. As none of the HTML encoding tags contains "emissivity", "emitter" or "emission", we used regular expressions to check these paper parts for the presence of the keywords without any further processing. The algorithm took 2 days to run on the database of 4.9 million papers obtained through special publisher agreement.